Practical Applications of Locality Sensitive Hashing for Unstructured Data
ثبت نشده
چکیده
Working with large amounts of unstructured data (e.g., text documents) has become important for many business, engineering and scientific applications. The purpose of this article is to demonstrate how the practical Data Scientist can implement a Locality Sensitive Hashing system from start to finish in order to drastically reduce the time required to perform a similarity search in high dimensional space (e.g., created by the terms in the vector space model for documents). Locality Sensitive Hashing dramatically reduces the amount of data required for storage and comparison by applying probabilistic dimensionality reduction. In this paper we concentrate on the implementation of min-wise independent permutations (MinHashing) which provides an efficient way to determine an accurate approximation of the Jaccard similarity coefficient between sets (e.g., sets of terms in documents) [2,3].
منابع مشابه
Locality sensitive hashing: A comparison of hash function types and querying mechanisms
It is well known that high-dimensional nearest-neighbor retrieval is very expensive. Dramatic performance gains are obtained using approximate search schemes, such as the popular Locality-Sensitive Hashing (LSH). Several extensions have been proposed to address the limitations of this algorithm, in particular, by choosing more appropriate hash functions to better partition the vector space. All...
متن کاملLV Barcoding: locality sensitive hashing-based tool for rapid species identification in DNA barcoding
DNA barcoding has emerged as a cost-effective approach for species identification. However, the scarcity of tools used for searching the booming reference database becomes an obstacle, currently with BLAST as the only practical choice. Here, we propose a program LV Barcoding based on both the random hyperplane projection-based locality sensitive hashing method and the composition vector-based V...
متن کاملMulti-Level Spherical Locality Sensitive Hashing For Approximate Near Neighbors
This paper introduces “Multi-Level Spherical LSH”: parameter-free, a multi-level, data-dependant Locality Sensitive Hashing data structure for solving the Approximate Near Neighbors Problem (ANN). This data structure is a modified version multi-probe adaptive querying algorithm, with the potential of achieving a O(np + t) query run time, for all inputs n where t <= n. Keywords—Locality Sensitiv...
متن کاملScalable Locality-Sensitive Hashing for Similarity Search in High-Dimensional, Large-Scale Multimedia Datasets
Similarity search is critical for many database applications, including the increasingly popular online services for Content-Based Multimedia Retrieval (CBMR). These services, which include image search engines, must handle an overwhelming volume of data, while keeping low response times. Thus, scalability is imperative for similarity search in Webscale applications, but most existing methods a...
متن کاملAPT: Approximate Period Detection in Time Series
Period detection from time series is an important problem with many real-world applications such as weather forecast, stock market predictions, electrocardiogram analysis, periodic disease outbreak. In this work, we present a novel approximate period detection method for time series. The simplicity of our algorithm and its adaptability for high dimensional datasets using renowned tools and tech...
متن کامل